feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
feat: add direct remote access over s3 and https via warcio >= 1.8.0#25handecelikkanat wants to merge 6 commits intomainfrom
Conversation
This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils. To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script. |
Previously this used a local file open, Ill check fsspec.
I was now thinking that Any other suggestions? |
|
@malteos Can |
EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️ |
ac6d444 to
7d99aee
Compare
bd16c58 to
157eeec
Compare
From https://github.com/commoncrawl/issues/issues/684
This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.
Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst
This PR adds:
fsspec.opencall to replace local open call inwarcio-iterator.pymake iterate-remoteto remote access the example whirlwind.warc.gz file in Github repo directly over https:make cdxj-remote-httpsandmake cdxj-remote-s3to index two EoT WARCs over https and s3make extract-remote-httpsandmake extract-remote-s3to extract records from the two EoT WARCs over https and s3warcio[s3]>=1.8.0run: make iterate-remote,run: make cdxj-remote-https,run: make extract-remote-https(No testing of s3 versions, which requires AWS creds)